112 research outputs found

    An optimal parallel connectivity algorithm

    Get PDF
    AbstractA synchronized parallel algorithm of depth O(n2/p) for p (≤n2/log2 n) processors is given for the problem of computing connected components of an undirected graph. The speed-up of this algorithm is optimal in the sense that the depth of the algorithm is of the order of the running time of the fastest known sequential algorithm over the number of processors used

    Empirical Challenge for NC Theory

    Full text link
    Horn-satisfiability or Horn-SAT is the problem of deciding whether a satisfying assignment exists for a Horn formula, a conjunction of clauses each with at most one positive literal (also known as Horn clauses). It is a well-known P-complete problem, which implies that unless P = NC, it is a hard problem to parallelize. In this paper, we empirically show that, under a known simple random model for generating the Horn formula, the ratio of hard-to-parallelize instances (closer to the worst-case behavior) is infinitesimally small. We show that the depth of a parallel algorithm for Horn-SAT is polylogarithmic on average, for almost all instances, while keeping the work linear. This challenges theoreticians and programmers to look beyond worst-case analysis and come up with practical algorithms coupled with respective performance guarantees.Comment: 10 pages, 5 figures. Accepted at HOPC'2

    Project for Developing Computer Science Agenda(s) for High-Performance Computing: An Organizer's Summary

    Get PDF
    Designing a coherent agenda for the implementation of the High Performance Computing (HPC) program is a nontrivial technical challenge. Many computer science and engineering researchers in the area of HPC, who are affiliated with U.S. institutions, have been invited to contribute their agendas. We have made a considerable effort to give many in that research community the opportunity to write a position paper. This explains why we view the project as placing a mirror in front of the community, and hope that the mirror indeed reflects many of the opinions on the topic. The current paper is an organizer's summary and represents his reading of the position papers. This summary is his sole responsibility. It is respectfully submitted to the NSF. (Also cross-referenced as UMIACS-TR-94-129

    Granularity of parallel memories

    Get PDF
    Consider algorithms which are designed for shared memory models of parallel computation in which processors are allowed to have fairly unrestricted access patterns to the shared memory. General fast simulations of such algorithms by parallel machines in which the shared memory is organized in modules where only one cell of each module can be accessed at a time are proposed. The paper provides a comprehensive study of the problem. The solution involves three stages: (a) Before a simulation, distribute randomly the memory addresses among the memory modules. (b) Keep several copies of each address and assign memory requests of processors to the "right\u27; copies at any time. (c) Satisfy these assigned memory requests according to specifications of the parallel machine

    An Immediate Concurrent Execution (ICE) Abstraction Proposal for Many-Cores

    Get PDF
    Settling on a simple abstraction that programmers aim at, and hardware and software systems people enable and support, is an important step towards convergence to a robust many-core platform. The current paper: (i) advocates incorporating a quest for the simplest possible abstraction in the debate on the future of many-core computers, (ii) suggests “immediate concurrent execution (ICE)” as a new abstraction, and (iii) argues that an XMT architecture is one possible demonstration of ICE providing an easy-to-program general-purpose many-core platform

    Parallel unit propagation: Optimal speedup 3CNF Horn SAT

    Get PDF
    A linear work parallel algorithm for 3CNF Horn SAT is presented, which is interesting since the problem is P-complete

    Can Parallel Algorithms Enhance Serial Implementation?

    Get PDF
    Consider the serial emulation of a parallel algorithm. The thesis presented in this paper is rather broad. It suggests that such a serial emulation has the potential advantage of running on a serial machine faster than a standard serial algorithm for the same problem. The main concrete observation is very simple: just before the serial emulation of a round of the parallel algorithm begins, the whole list of memory addresses needed during this round is readily available; and, we can start fetching all these addresses from secondary memories at this time. This permits prefetching the data that will be needed in the next "time window", perhaps by means of pipelining; these data will then be ready at the fast memories when requested by the CPU. The possibility of distributing memory addresses (or memory fetch units) at random over memory modules, as has been proposed in the context of implementing the parallel-random-access machine (PRAM) design space, is discussed. This work also suggests that a multi-stage effort to build a parallel machine may start with "parallel memories" and serial processing, deferring parallel processing to a later stage. The general approach has the following advantage: a user-friendly parallel programming language can be used already in its first stage. This is in contrast to a practice of compromising user-friendliness of parallel computer interfaces (i.e., parallel programming languages), and may offer a way for alleviating a so-called "parallel software crisis". It is too early to reach conclusions regarding the significance of the thesis of this paper. Preliminary experimental results with respect to the fundamental and practical problem of constructing suffix trees indicate that drastic improvements in running time might be possible. Serious attempts to follow it up are needed to determine its usefulness. Parts of this paper are intentionally written in an informal way, suppressing issues that will have to be resolved in the context of a concrete implementation. The intention is to stimulate debate and provoke suggestions and other specific approaches. Validity of our thesis would imply that a standard computer science curriculum, which prepares young graduates for a professional career of over forty years, will have to include the topic of parallel algorithms irrespective of whether (or when) parallel processing will succeed serial processing in the general purpose computing market. (Also cross-referenced as UMIACS-TR-91-145.1

    XMTSim: A Simulator of the XMT Many-core Architecture

    Get PDF
    This paper documents the features and the design of XMTSim, the cycle-accurate simulator of the Explicit Multi-Threading (XMT) computer architecture. The Explicit Multi-Threading (XMT) is a general-purpose many-core computing platform, with the vision of a 1000-core chip that is easy to program but does not compromise on performance. XMTSim is a primary component in its publicly available toolchain along with an optimizing compiler. Research and experimentation enabled by the toolchain played a central role in supporting the ease-of-programming and performance aspects of the XMT architecture. The compiler and the simulator are also important milestones for an efficient programmer's workflow from PRAM algorithms to programs that run on the shared memory XMT hardware. This workflow is a key component in accomplishing the goal of ease-of-programming and performance. The applicability of the XMT simulator extends beyond specific XMT choices. It can be used to explore the much greater design space of shared memory many-cores by system researchers or by programmers. As the toolchain can practically run on any computer, it provides a supportive environment for teaching parallel algorithmic thinking with a programming component.National Science Foundation grant CCF-081150

    Empirical Speedup Study of Truly Parallel Data Compression

    Get PDF
    We present an empirical study of novel work-optimal parallel algorithms for Burrows-Wheeler compression and decompression of strings over a constant alphabet. To validate these theoretical algorithms, we implement them on the experimental XMT computing platform developed especially for supporting parallel algorithms at the University of Maryland. We show speedups of up to 25x for compression, and 13x for decompression, versus bzip2, the de facto standard implementation of Burrows-Wheeler compression. Unlike existing approaches, which assign an entire (e.g., 900KB) block to a processor that processes the block serially, our approach is “truly parallel” as it processes in parallel the entire input. Besides the theoretical interest in solving the “right” problem, the importance of data compression speed for small inputs even at great expense of quality (compressed size of data) is demonstrated by the introduction of Google’s Snappy for MapReduce. Perhaps surprisingly, we show feasibility of holding on to quality, while even beating Snappy on speed. In turn, this work adds new evidence in support of the XMT/PRAM thesis: that an XMT-like many-core hardware/ software platform may be necessary for enabling general-purpose parallel computing. Comparison of our results to recently published work suggests 70x improvement over what current commercial parallel hardware can achieve.NSF grants CCF-0811504 and CNS116185

    Parallel Algorithms for Burrows-Wheeler Compression and Decompression

    Get PDF
    We present work-optimal PRAM algorithms for Burrows-Wheeler compression and decompression of strings over a constant alphabet. For a string of length n, the depth of the compression algorithm is O(log2 n), and the depth of the the corresponding decompression algorithm is O(log n). These appear to be the first polylogarithmic-time work-optimal parallel algorithms for any standard lossless compression scheme. The algorithms for the individual stages of compression and decompression may also be of independent interest: 1. a novel O(log n)-time, O(n)-work PRAM algorithm for Huffman decoding; 2. original insights into the stages of the BW compression and decompression problems, bringing out parallelism that was not readily apparent, allowing them to be mapped to elementary parallel routines that have O(log n)-time, O(n)-work solutions, such as: (i) prefix-sums problems with an appropriately-defined associative binary operator for several stages, and (ii) list ranking for the final stage of decompression.NSF grant CCF-081150
    • …
    corecore